DRAFT - Protocols for de Novo whole genome assembly, annotation, and identification of strains (v2022.0.0)
Assembly, annotation, sppID, and characterization
1 Protocol Notes
Note: This is an html document and best viewed in a web browser.
We are working with endophytes, microorganisms that colonize the internal tissues of plants. We have isolated many strains of these endophytes from native, wild plants and are studying them for their ability to improve host-plant tolerance to different environmental stresses. We perform experiments to qualify and quantify a range of potentially beneficial traits using empirical methods. We also test whether the endophytes can be introduced to a novel host, and whether the new host will benefit - which is measured by improved growth/productivity in the host under environmental stress.
Terminology: Strain vs. Species
Reviews on endophytes: (Reinhold-Hurek and Hurek 2011; Hardoim et al. 2015)
Because phenotypes (observed traits) are explained by genotypes (genetic traits), our investigations would benefit from having the DNA sequences of the endophytes. This information can inform both our applied and basic science experiments.
Accordingly, we have sequenced several of our endophytes and received the raw sequence reads as data files. We will use those files to assembly, annotate, and identify these strains. In subsequent protocols we will mine the genomes for genes and pathways of interest.
Whole genome sequence (WGS) assembly is the process whereby raw sequence reads are assembled into a draft genome. The assemblies are typically completed either by read mapping, which uses a high-quality reference genome of a closely related species as a guide to assemble the draft genome, or as de Novo, where the draft genome is assembled using only the information contained in the sequence reads.
The best analogy of a de Novo assembly that I have seen is that it is like putting a book back together after it has been through a shredder - without having an original copy of the un-shredded book. Both protocols have their pros and cons but we will be performing de Novo assemblies because we are working with the genomes of environmental strains which are less well curated than the genomes of clinical strains, and de Novo assemblies have less bias.
Recommended literature:
These are introductory papers for WGS process. Associated papers for each step are including in with the section.
This docuement is for de Novo WGS using Illumina paired-end reads, and contains our current protocols for assembly, quality control, annotation, and strain identification. For each steps there are many alternative protocols available. This document contains only the protocols that we are currently using.
In order to maintain Reproducible research it is important to record software version and date used for each step in the protocol. Updates and new versions can alter analysis and output, and may include bugs. If re-analysis is required, use same version of software from original analyses where possible.
All relevant literature should be cited in the appropriate section and a reference list included at the end of the document. These documents can be found in the Doty Lab protocols notebook and/or the Doty Lab MS Teams Literature folder. For ease of use, it is recommended to import a copies of the articles into your literature management software, e.g., Zotero.
1.1 How these protocols are structured:
Each section includes background notes, the required steps, and concludes with examples (blocked out) used from a prior experiment. The background notes and examples don’t need to be included in your notebook. The steps listed are the standard methods/activities and any deviations or revisions should be thoroughly noted.
Screen captures of the key steps are included for clarity and also don’t need to be included in your notebooks.
The code chunks are printed in this document (echo=T) as part of the protocol instructions, but would normally be hidden in the html document (echo= F).
These protocols assume that you are already somewhat familiar with basic computer commands, R Studio/Markdown, and our Doty Lab Digital Notebook formatting and style guides for file structure, naming, coding, etc.
It also assumes that you have installed and configured Geneious Prime (GP) per the instructions and completed the introductory tutorials.
1.2 Basic pipeline for assembly, annotation, and species identification.
- Create project/experiment directories and R Markdown notebook as needed.
- Obtain the raw seq. read files, either downloaded from sequencing lab or provided by a colleague.
- Import raw seq. reads into Geneious Prime (GP)
- Trim and Normalize raw seq. reads in GP
- Perform assembly in GP
- Filter contig sequences to \(\ge\) 1000 bp
- Export contig files and review assembly quality using CheckM
- Import contig files into RAST for annotation
- Download annotation files and import into GP
- Import contig files into TYGS for type-strain whole genome identification
- Summarize findings
You will need to start the document with the formatted YAML template and the setup code chunk. Neither of these will appear on the final document, but the YAML contains formatting and output instructions and the code chunk loads the necessary R packages to format data and tables.
Both of these topics are covered in the Digital Notebook protocol, and a template containing both the YAML and the setup code chunk can be found in our MS Teams Bioinformatic Protocols/scripts folder.
1.3 Basic Section Layout
Per our lab Digital Notebook protocols, each step should include the relevant information for each application in this order:
- Information about local software or online platform used
- Methods: What was done
- Results: Data produced
- Brief summary of results
- The submitted and output directory/files pathway
In detail:
- The information about the software/platform should include:
Name, version, and literature references.
- In the case of online platforms, include website link, date accessed, and whether data is stored on their servers or downloaded (or both). In the case for online data, note approximately how long it will be maintained on the platform server.
- Methods: A brief description of what was done for each step. As with all methods, should contain enough detail that another researcher could reproduce your steps. This could include:
- Type of files used (submitted); including directory and file names
- Software settings and/or options
- Deviations or revisions from standard protocols
Results: Typically a simple data table is produced of relevant output, or description of output.
Summary of findings: A brief summary of the results, providing sufficient information to write formal methods and results for publication. Note any interesting, unusual, or unexpected findings.
Document the directory pathway (where files originated or were stored) and the file name format for the files. Note: No need to list all of the file names, just the abbreviated (wildcard) format.
Note: Figures and summaries do not need to be publication quality. The information just needs to be accurate and clearly presented; sufficient enough other researchers to produce formal output for publication.
2 de Novo assembly, annotation, and identification protocols
2.1 Project/Experiment Information
2.1.1 Introduction
Short introduction for experiments in this notebook; referencing which project and the relevant details sufficient enough to give another reader familiar with the project the necessary information to understand what is contained in the notebook.
2.1.2 Collaborators
Names and contact information collaborators (as needed). For smaller projects, this will be the lead researcher in our lab, where larger projects may include external collaborators.
Robert Tournay, Doty Lab UW, tournay@uw.edu
2.1.3 Summary of findings
This is the last step. When all analyses are completed you will write up the summary of findings in this section, with the Tables linked.
2.1.4 Location of data files.
Identify location all associated files. These should include the file pathways for:
- Local project files stored on your UW Google Cloud storage.
- The GitHub repository in the Doty Lab UW github account.
- Location of the raw data sequence files. These are archived on a UW Google account and are not included in the GitHub repository. Copies of the files can be saved in the local project folder for import to Geneious Prime for assembly.
The GitHub link can be found by going to the Project Repository, clicking the green Code button and copying the link.
If you pull a copy of the raw sequence read files into your local repository you will need to add it to the .gitignore file due to space limitations in GitHub.
Example:
Local project files:
~/Projects/Proj-DOE
Github:https://github.com/Doty-Lab-UW/Proj-DOE.git
Raw data:~/Bioinformatics/raw_data/DOE
2.1.5 DNA Isolation and Sequencing
Notes on the extraction of gDNA and library prep as appropriate. Because these steps are field and bench experiments, the detail can be minimal. If these steps were completed by other researchers, then that can be noted.
Example:
Collection, isolation, gDNA prep by Doty/Sher. Collection and isolation per Doty Lab protocols. DNA extractions using Nextera DNA Flex kit (Illumina, Cat. No. 20018705).
Information of lab that did the sequencing.
- Include name of lab (link to website)
- Kit names and IDs
- Sequencing lab’s project numbers
- how and when the raw data were obtained
Typically, this will be provided to you.
Example:
Library construction, amplification and sequencing by GENEWIZ,Inc.
- Reference Project #30_459016717
- MiSeq platform (300-Cycle V2 Reagent Kit)
- Sequences (fastq.gz) downloaded from Genewiz on 2021-06-17
2.1.6 Strains used in this study
Include list of strains assembled. Often, these have been identified using 16S rRNA. The 16S rRNA method is typically reliable to Genus, but revisions may occur when the identification is completed using the whole-genome sequence alignments here. The 16S rRNA method is less reliable for identification to the species level, although it should be included on the table when provided.
For one, or a few, can be bullet list. For longer lists, a table can be built. Tables can be constructed in R Markdown or, more commonly, imported from excel or other document (.csv, .tsv, or similar simple format).
A table built in R Markdown:
| Strains | Genera | Species (if provided) |
|---|---|---|
| WW5 | Sphingomonas | sp. |
| SherDot1 | Azotobacter | beijerinckii |
| Root10 | Rahnella | aquatilis |
| NS | Paraburkholderia | sp. |
Screen shot of the code that produced the above table in R Markdown.
Thumbnail: click to enlarge
Alternatively, a table can be built quickly using an external document and the kableExtra and tidyverse packages.
###############
# There are many ways to import data and create tables in R Studio.
# These tables do not need to be "publication" quality, and so you don't need
# to spend a lot of time formatting. Just make them readable.
#
# I created a /data sub-directory in the /rmd directory and imported an
# excel .csv file containing the strain data.
#
# Recall that normally, code chunks are not included in report outputs. Here, I
# used the option echo=T to print the code chunk, but normally the call would be
# echo=F, to hide the code on the output document.
#
# Also, note that the "kableExtra" and "tidyverse" packages were previously
# loaded into the R environment.
#
# Below here is what I would actually include in my code chunk .
###############
# This command imports the data into R and saves it as the variable strains
strains <- read.csv("data/DOE_16S_sppID.csv")
# This command selects only the columns containing Strain and scientific name
strains <- strains %>%
select(Strain, Spp_16S_rRNA)
# Code for creating a captioned table, with minor formatting
kbl(strains, caption = "Strains included in this study") %>%
kable_styling(full_width = F) %>% # full_width=F for narrow tables
column_spec(2, italic = TRUE) # italicize scientific names (col 2)| Strain | Spp_16S_rRNA |
|---|---|
| WW5 | Sphingomonas sp. |
| SherDot1 | Azotobacter beijerinckii |
| Root 10 | Rahnella aquatilis |
| NS | Paraburkholderia sp. |
2.2 de Novo Assemblies using Geneious Prime
There are many methods (both CLI and GUI) for de Novo assemblies. This protocol uses Geneious Primer (GP) software, a GUI platform that can perform various assembly methods, e.g., SPAdes and velvet, in addition to their own. There are strengths and weakness for each of the assembly methods (Forouzan et al. 2017). I have found that both GP and SPAdes provide high quality assemblies with good coverage. These protocols will describe the GP assembler, but it is good to know that there are other options available.
The general de novo assembly pipeline in GP is:
- Import paired-end raw seq. reads files into named folder on GP
- Trim the reads using BBDuk plug-in
- Normalize the trimmed read files using BBNorm plug-in
- Assemble the trimmed & normalized read files using the GP assembler
- Record the assembly data in a .csv file and create a table of relevant data
- Extract contigs \(\ge\) 1000 bp into
<strain>-Contig.fastafile for annotation
- Summarize results
- Export
<strain>-Contig.fastaintoLocal project directory/folderand update version control on Github
In detail:
- Import paired-end raw seq. reads files into named folder on GP
This protocol assumes that you have already installed Geneious Prime, and have taken the set-up tutorials. You should already have the plug-ins installed, know how to create project directories and folders.
- To import the reads, open Geneious Prime and open the folder that you have created and named for the strain.
- Navigate to the raw reads directory (folder) containing the paired raw sequence read files.
- Select both files (SHIFT - click) and drag them together into the folder in Geneious Prime. Alternatively, you can follow the pathway
File > Import > Files
A pop-up window will open and the options should reflect the sequencing technology used: - Read Technology: Illumina - Pair (checked) Paired End (inward pointing) (pairs of files) with insert size 150
Note: Normally our lab uses Illumina paired-end reads, and the read length will either be 150 bp or 250 bp, typically. Once you set this it will default to your settings for future assemblies. The read length is provided in the report from the sequencing lab.
The paired-end reads will be merged into a single file, and the resulting GP file should look something like this:
Thumbnail: click to enlarge
To verify that the import was successful, look at the
# Sequences and Max. Sequence Length columns. The data should match the QC report from the sequencing lab. Also, the % GC should be similar - although a slight variance is acceptable. The # Nucleotides should be approximately the product of the number of sequences and the max sequence length.
It is highly recommended to rename the file to the strain name; right-click on the file name, and Edit name. You don’t need to edit the file description.
Notice that the file pathway, file name, and created date/time are recorded.
- Trim the reads using BBDuk plug-in
The data files contain millions of short (~250 bp) sequence reads that will we are assembling into a draft genome. Due to the high number of reads, there is \(\gt\) 30x sequence coverage for each nucleotide. Each nucleotide on every read has a quality score (phred) that represents the certainty that that nucleotide is the correct one - and not the result of PCR or sequencing errors.
BBDuk trims the seq read files based on minimum phred scores and short lengths. These parameters are set by the user. To trim the reads, select your strain file in GP and then Annotate and Predict > Trim using bbDuk. A pop-up window will open, record the BBDuk version and complete the boxes and options like this:
- Trim adapters: unchecked
- Trim Low Quality: checked, Trim: Both Ends, and Minimum Quality 30
- Trim adapters based on paired read overhangs: Checked, Minimum Overlap: 20
- Discard Short Reads: checked, Minimum Length: 20 bp
- Trim Low Complexity (Entropy): unchecked
- Keep original order (slower, but ensures…): checked
- Max. Memory: default
- Custom BBDuk options: blank
The bolded parameters are reported and should be recorded in your notes.
Thumbnail: click to enlarge
This step should take a few minutes, depending on CPU power and available RAM.
GP creates a new file adding (trimmed) to your strain name, the un-trimmed file is retained for comparison and reproducibility.
- In a second quality control step, the trimmed read file will be normalized for depth of coverage. As noted above, each nucleotide in the genome has been redundantly sequenced many times. The term for how many times a read is sequenced is called Depth of Coverage, or DOC. Most NGS technologies will target a average DOC of 30x - 90x. However, not all nucleotides were sequenced the same number of times. Some sequences are over-represented, and past a certain point extra data doesn’t improve results, but it does increase computation time.
The BBNorm plug-in normalizes the DOC based on user set parameters. Select the strain (trimmed) file, and then Sequence > Error Correct and Normalize Reads.... A pop-up window will open, record the BBNorm version and complete the boxes and options like this:
- Error correction: unchecked
- Normalization: checked, Target Coverage Level: 40, Minimum depth: 6
The bolded parameters are reported and should be recorded in your notes.
Thumbnail: click to enlarge
This step should take a few minutes, depending on CPU power and available RAM.
GP creates a new file adding (trimmed)(normalized) to your strain name. Again, previous files are retained for comparison and reproducibility.
Note: If BBDuk and BBNorm make you feel like you are throwing away data, it is because you are. However, due to the high-quality sequencing and DOC of modern NGS technologies, you are eliminating noise that will increase your assembly time, and may reduce your downstream results in the long run. If you want to convince yourself of this you can run the assembly and compare downstream applications on all three versions.
- The de Novo assembler programs use complex algorithms, e.g., (Rizzi et al. 2019), to assemble the raw seq. reads (~ 250 bp) into contiguous sequences, called contigs, ranging from as long as hundreds of thousands of bp down to hundreds of bp in length; each contig representing a portion of the genome. Accordingly, draft genome assemblies are composed of a set of contigs rather than one continuous sequence representing the entire genome. However, this is sufficient for most downstream applications.
The quality of the draft assembly is affected by a range of factors, including the quality of upstream processes (e.g., DNA extraction, PCR amplification and sequencing) and the complexity of the organism’s genome. Accordingly, the GP assembler will produce a report that contains information on the quality of the assembly.
Select the <name of strain>(trimmed)(normalized) file and then Align/Assemble > De Novo Assemble... to open up the parameters window. Complete as follows:
Assembler: Geneious
Sensitivity: Medium-Low Sensitivity/Fast
Trim Before Assembly: Do not trim
Results:
- Assembly Name: default
- Save assembly report: checked
- Save list of unused reads: unchecked
- Save in sub-folder: checked
- Save contigs (checked Maximum 1000)
- Save Consensus Sequences: checked, default options
- Assembly Name: default
Circularize contigs of \(\ge\) 3 sequences, if ends match: True
Produce Scaffolds: True
Thumbnail: click to enlarge
This is the assembly step and should take significantly longer (hours) depending on the CPU power, available RAM, and quality/complexity of the read files. You can reduce the time by only assembling a single strain at a time and closing other applications.
This step creates a sub-directory in GP called <name of strain>(trimmed)(normalized) Assembly that contains the assembly report (txt document), a consensus file, and the contig files. Depending on the assembly, there could be dozens or hundreds of contigs, although most will be short, low coverage contigs that can be filtered out without a loss of quality.
- In the Assembly directory, click on the file
Assembly Reportand record the following information in an excel spreadsheet or text document in a table format.
Strain and Genus
From the Contigs >= 1000 bp column:
- Length Sum (bp)
- Number of
- Max Length (bp)
- N50 length (bp)
Sort the contig files by length, and record:
- The %GC and Mean Coverage (DOC) of the longest contig.
Sort the Topology by type and record the quantity (n) and length (bp) of circular sequences \(\ge\) 1000 bp. These are GP assembler predicted plasmids. Some assemblies may not have any, others may have several.
Save the table as a .csv file, and place it in the ~/rmd/data folder in your local project folder.
This image shows how the data look in excel and an R Studio text file.
Thumbnail: click to enlarge
And an example of the Table as shown in the report:
#This creates a variable to store the caption text, keeps code below cleaner
gp_cap <- "Geneious Prime Assembly Statistics"
# This command imports the data into R and saves it as the variable strains
gpStats <- read.csv("data/GP-stats.csv")
# Creates formatted table from data
kbl(gpStats, caption = gp_cap) %>%
kable_styling() %>%
column_spec(2, italic=T)| Strain | Genus | Genome_Mbp | Contigs | Max_Len | N50 | pctGC | meanDOC | Plasmids | P1 | P2 |
|---|---|---|---|---|---|---|---|---|---|---|
| WW5 | Sphingomonas | 5.76 | 118 | 252688 | 88302 | 64.00 | 40 | FALSE | NA | NA |
| SherDot1 | Azotobacter | 4.93 | 154 | 298403 | 75606 | 65.90 | 50 | TRUE | 77262 | NA |
| Root 10 | Rahnella | 5.71 | 46 | 870763 | 74700 | 52.11 | na | UNK | NA | NA |
| NS | Paraburkholderia | 8.05 | 84 | 853358 | 169632 | 62.50 | 38 | TRUE | 37673 | 13240 |
This table covers the basic QC parameters for genomic assemblies. Identifying good assemblies comes with practice, but generally you want to see:
Are the genome size and %GC reasonably comparable to other strains in the same Genus?
- A preliminary review can be done with a quick internet search by “
AND genome size” or “ AND %GC,” to answer the question: Are the results reasonable? - For example, notice with this data set that the Paraburkholderia genome is large relative to the other strains. A quick internet search will confirm that a genome of ~8 Mbp is reasonable for this Genus. The same process can be followed with %GC. A more formal investigation can be done using the literature.
- A preliminary review can be done with a quick internet search by “
Fewer, longer contigs are considered more complete assemblies, where many, shorter contigs could indicate assembly issues or noise in the data set. You can also look at the longest contigs - to confirm whether they contain most of the sequence data.
N50 is analogous to the median contig length; longer is better.
DOC for the longest contigs should be better than 30x coverage, and higher is better. If you look at the contig files, you will notice the DOC drops significantly on the shorter contigs.
These are rough guides to the assembly quality, and what you are looking for are anomalies that stand out, unreasonable genome size or %GC for the genus, high numbers of short contigs, or a short N50, etc. In the next step, we will do a further QC.
- With that in mind, the next step is to extract the contigs \(\ge\) 1000 bp into a separate file, effectively filtering out the short reads. The 1000 bp cut-off is arbitrary and some protocols retain contigs \(\ge\) 500 bp. Recall that there is redundancy in the data (DOC) and the shorter contigs are lower quality and have more noise - as reflected in the DOC and GC% - and this step is filtering out that noise.
To filter the contigs, select the Consensus Sequence file, the Lengths Tab, and then the Extract button. This will open a pop-up window where you can specify the Extract sequences with minimum length to 1000 bp. No other changes.
This will create a new Consensus Sequence file containing only the contigs that are \(\ge\) 1000 bp. Rename this to <strain>-ConSeq. This is the working draft assembly file.
- Now you should write a brief summary of the results, stating your interpretation of the quality of the assemblies, whether they seem reasonable, and noting any problems, exceptions, or deviations from the sample protocols. It should be enough detail that your collaborators (or you, six months from now) can quickly scan the summary and interpret the results.
If there were no issues during assembly, this write up should be very brief; and will encompass only a sentence or two when submitted for publication.
- The final step is to export the
<strain>-ConSeq.fastaintoLocal project directory/folderand update version control on Github
Select the filtered file in GP, then Export > Export Documents and save the file to your Project folder under /output/GP-assembly/<strain>-ConSeq.fasta and then document the directories and files.
Example GP Assembly notes from another project:
Geneious Prime (v2021.2.2)
BBDuk and BBNorm (v38.84)
R version 4.0.2 (2020-06-22) – “Taking Off Again”
Paired end read files were imported and combined into a single file.
Reads were trimmed BBDuk with the phred score was set to minimum 30, and reads < 20 bp discarded.
The trimmed read files were then normalized for coverage, with the target depth of coverage set to 40 and the minimum set to 6.
Trimmed and normalized read files were then assembled using the Geneious Prime assembler with the following settings:
- Assembler: Geneious
- Sensitivity: Medium-Low Sensitivity/Fast
- Trim Before Assembly: No
- Save Consensus Sequences: TRUE
- Circularize contigs of => 3 sequences, if ends match: True
- Produce Scaffolds: True
Assembled consensus sequences were filtered for contigs \(\ge\) bp for downstream applications.
Normally the assembly table shown above would be inserted here in the project notebook. Not included in this example for clarity.
###########
Four endophyte strains were assembled from paired-end raw sequence reads using the Geneious Prime software. The trimmed and normalized read files were assembled using the native Geneious Prime assembler. The draft genome-size and GC content for each strain appeared reasonable when compared to the whole genome-assemblies from members of the same genus.
Imported files:
~/raw_data/Apple-Bio/GENEWIZ-files/00_fastq/*.fastq.gz
- Where * is unique file identifier
Output to:~/output/GP-assembly/<strain>-Conseq.fasta
2.3 Quality of assembly with CheckM (KBASE)
An additional check of the assembly quality can be performed using the CheckM application to assess genome coverage and screen for sequence contamination.
This online application is part of the KBASE platform of the US Department of Energy. You won’t be loading software onto your computer, but accessing their servers remotely. If this is your first time accessing KBASE you will need to create an account. The accounts are free and can be created with your uw.edu email address.
After creating an account, you should see a dashboard that looks something like this:
Thumbnail: click to enlarge
On the left you will see my Narratives, or project directories in KBASE.
To start a narrative, click the green + New Narrative button in the upper right corner. That will take you to this screen:
Thumbnail: click to enlarge
If you look at the top, you will see your new narrative title (untitled), which can be changed by you.
You can see three main sections, The upper left - where your data will be uploaded and stored. The lower left - different apps for the data. And the main section - which, for now, has links for documentation and tutorials.
The tutorials will show you how get your data from your local Project directories into your newly created narrative.
You need to import your <strain>-ConSeq.fasta files from your assemblies. If you are working with multiple strains, you can upload all of them into the same narrative at the same time.
General steps:
- Consensus files:
<strain>-ConSeq.fasta - Load data, define data type (i.e. Assembly), and import them into your narrative.
- Run the
CheckMapplication once for each file. - Record and analyze the output data.
CheckM can be found in the Apps panel at the bottom left of your dashboard. If you click on Genome Assembly and scroll down to CheckM, and click the app will open 1 instance in your main panel. From that instance you can run analysis on a file by selecting it from the pull-down menu. You repeat this and open and run another instance of CheckM if you have multiple strains to analyze. Analysis will take less than 30 minutes, depending on server load.
There are no files to download with this app; actually, there are, but it is not worth the effort. We are only interested in two values:
- Completeness (should be \(\ge\) 95%)
- Contamination (should be \(\le\) 5%)
These numbers can be found on the CheckM TABLE tab on the report.
Thumbnail: click to enlarge
I just record them in my notebook. If the assemblies fail the CheckM metrics then that indicates problems with the one of the prior steps; i.e., DNA extraction, PCR amplification, sequencing, or assembly steps, and further analysis is necessary.
Example:
KBASE
Date accessed: 2021-10-11
CheckM version 1.0.18
Reference: (Parks et al. 2015)The GP ConSeq assemblies, filtered \(\ge\) 1000 bp, were analyzed in CheckM for the quality of the assembly; i.e., completeness of coverage and contamination. Thresholds for high quality assemblies are \(\ge\) 95% completeness and \(\le\) 5% contamination.
All strains were found to have \(\gt\) 99% coverage with \(\lt\) 0.7% contamination (data not shown, but recorded in
checkM-res.csv).Submitted files:
~/output/GP-ConSeq/<strain>-ConSeq.fasta
Output directory/files:~/r-nb/data/checkM.csv
You have now assembled a high-quality draft genome that is ready for downstream analyses. There are other protocols to further refine the genome, such as mapping the contigs to a reference genome, but this stage us sufficient for our needs.
2.4 Annotation of assembled draft genomes
Once the assemblies are complete and meet QC criteria, the next step is annotation. Annotation is the process of identifying the open reading frames (ORF) from the coding sequences (CDS). Terminology. From this information annotation protocols attempt to predicts the protein-encoding, rRNA and tRNA genes, and assigns functions for those previously described. Using that information, different subsystems and pathways can be investigated.
This process depends on the quality of the assembly, the algorithms used, and how well the genes and/or species have been studied. For example, a high-quality assembly of a well described pathogen, such as Erwinia amylovera or Pseudomonas aeruginosa, will likely be more fully annotated than a novel strain of some of less well studied species, like Rahnella aceris. Similarly, well studied genes are more likely to be correctly annotated, even if they occur in a novel strain.
Previously, annotations were completed manually, a laborious and time-consuming effort, requiring extensive expertise. Now, draft annotations are done computationally using various algorithms to identify candidate genes and results are generated in a fraction of the time. Just as with the assembly, there are a variety of different annotation protocols. In fact, it is not unusual to have a draft genome annotated by several programs to cover more of the genome.
We are going to start with a common annotation pipeline called the Rapid Annotation using Subsystem Technology (RAST). You will need your own account (free), and for me the process took several days - so you will want to plan ahead.
Once you have the account, the process is straight forward. You will begin by initiating a new job.
Thumbnail: click to enlarge
On the next screen, attach and upload your <strain>-ConSeq.fasta file and submit. Next, complete the strain metadata:
- Taxonomy ID: blank
- Taxonomy string: blank
- Domain: bacteria
- Genus: input genus
- Species: sp.
- Strain: strainID
- Genetic code: 11 (Archaea, most Bacteria…)
and finally the RAST parameters:
- RAST annotation scheme: RASTtk
- Customize RASTtk pipeline: unchecked
- Automatically fix errors: checked
- Fix frame shifts: checked
- Build metabolic model: checked
- Turn on debug: checked
- Set verbose level: 0
- Disable replication: unchecked
Then submit the job. You will receive an email when it is complete.
Jobs can be found in Your Jobs > Jobs Overview. The annotation ID is the reference number that will be on future downstream analyses, and you can view details to the next screen.
Thumbnail: click to enlarge
On the next screen you will need to download three files to the ~/output/RAST directory in your project folder:
- Amino-Acid FASTA file, as
<strain>-aaRAST - Genbank, as
<strain>-gbRAST - Spreadsheet (tab-separated text format), as
<strain>-txtRAST
These files contain basically the same information, but in different formats, and will have the extentions: .fasta, .gbk, and .txt, respectively.
Thumbnail: click to enlarge
Finally, follow the Browse annotated genome in SEED viewer to record:
- Number of subsystems
- Number of Coding Sequences
- Number of RNAs
For one or a few genomes, it is easiest to just record directly into the notebook, while for many strains a table may be more appropriate.
Ex: > Strain 3YPLD (Pantoea), Subsystems: 355, Coding seq.: 5094, RNAs: 88
Thumbnail: click to enlarge
The SEED viewer presents the annotation data in graphic and tabular formats, and you can search for genes and subsystems (e.g. protein secretion systems), do comparative analyses, view diagrams, and much more. At this point we are collecting data and will return to look at these features later, but feel free to explore.
You should have three files saved, and a few data recorded.
Example:
Rapid Annotation using subsystem technology (RAST)
RAST version 2.0, upload dates: 2021-08-02
Data stored indefinitely (not specified) in user’s page
Reference: (Aziz et al. 2008)Consensus sequence files (filtered for contigs \(\ge\) 1000 bp) were submitted for annotation.
The following options selected:
- RAST annotation scheme: RASTtk
- Customize RASTtk pipeline: unchecked
- Automatically fix errors: checked
- Fix frame shifts: checked
- Build metabolic model: checked
- Turn on debug: checked
- Set verbose level: 0
- Disable replication: unchecked
| Strain | CDS | Subsystems | RNAs |
|---|---|---|---|
| 1SSLD | 4768 | 337 | 89 |
| 2ALA1 | 5286 | 379 | 72 |
| 2PtLD | 4978 | 376 | 97 |
| 2RDLD | 5139 | 372 | 106 |
| 3RS3 | 5413 | 356 | 82 |
| 3ThS2 | 6273 | 379 | 62 |
| 3WL2 | 3237 | 297 | 72 |
| 3YPLB | 5646 | 369 | 74 |
| 4RDLA | 4847 | 355 | 81 |
| 4RLE | 5524 | 360 | 92 |
No summary.
Submitted files:
~/output/GP-ConSeq/<strain>-ConSeq.fasta
Output directory:~/output/RAST/
Output files:
<strain>-annRAST.faathe amino acid sequences by PEGs<strain>-annRAST.gbkGenBank format<strain>-annRAST.txtfull file, incl. na, aa, protein, PEGs
2.5 Identification
The final step in this tutorial is the identification of the strains. Traditionally, identification is done by sequencing the 16S ribosomal RNA gene. The 16S rRNA gene (~1500 bp) is highly conserved due to its role in transcription, present in all bacteria, fast and easy to target and amplify, and there exists a very large database of bacterial 16S rRNA sequences for phylogenetic analysis.
However, because of the dynamic nature of bacterial genomes, and the fact that many species harbor multiple distinct 16S rRNA genes, this method of identification is considered reliable only the the Genus level.
See: (Janda and Abbott 2007), particularly Issues section.
That does not mean that 16S rRNA sequencing is not useful. It is still widely used as a quick, inexpensive first-step identification when working with a new or unknown isolate. It is also widely used in microbiome analysis of mixed spp. samples.
Instead, we are going to use an identification method utilizing the whole-genome sequences of our strains. The assembly data will be aligned against a database containing type-strains genomes. Type-strains, or reference strains, represent the original isolates used in species and subspecies descriptions for taxonomic classification, and which exhibit all of the relevant phenotype and genotypic properties for the given species.
We are going to upload the assemblies to the Type (Strain) Genome Server (TYGS), which contains a comprehensive database, \(\gt\) 15k species and subspecies, of validly published type strains for whole-genome based methods for phylogeny and taxonomic classification. Like the previous steps, there are a variety of protocols for identification of species, using different methods.
Similar to the RAST annotation step, you will upload your annotated assembly to the TYGS server, which will complete the analysis and prepare data files for download.
Under the Submit your Job link, upload your annotated assembly (the RAST file), you can upload either the <strain>-aaRAST.faa (amino-acid) or the <strain>-gbRAST.gbk file. Leave, everything else at default (blank), enter your email and submit. You will receive an email when the job is complete with a link to the results. The processing will take a few hours to a few days, depending on server load.
TYGS infers species/subspecies taxonomic classification and phylogenies using digital DNA:DNA hybridization (dDDH) of whole genomes against the type-strain. A dDDH d4 score of >70% is the threshold for classification to the species level. You can look at the information icon for more information on the dDDH categories.
Table 1 contains the phylogenies results. You can view phylogenetic trees based on both the 16S rRNA sequences and the WG sequences.
Table 2 shows the results of the identification. It will report the strain as identified as X species, or as a “Potentially new species.”
Table 3 shows the dDDH scores w/ Confidence Intervals and %GC difference for the closest type-strain matches. The d4, \(\ge\) 70%, is used for identification.
Table 4 is the reference literature for compared type-strain species.
There are five files to download. Save them in the ~/output/TYGS folder in the project drive.
Table 1: download both “Newick” files for the strain. Save as
<strain>-16Sphy.txtand<strain-WGphy.txt.Table 3: download a .csv formatted file of the dDDH scores;
<strain>-dDDH.csvTable 4: download a .csv formatted file of the dDDH scores;
<strain>-ref.csvTop of the screen:
Download PDF Reportto save a PDF version of the report; `-TYGS. Besides the summaries, this file contains methods and results sections that can be used/modified for formal reports.
Note: Newick is a standard format that can be imported into a variety of apps (including Geneious Prime) to produce graphic versions of the phylogenetic trees. Example:
Example:
Type (Strain) Genome Server (TYGS)
TYGS v321, upload date: 2021-08-02
Data stored for 60 days on site under unique ID, downloadable
Reference: (Meier-Kolthoff et al. 2013)The annotated assemblies were submited to the Type (Strain) Genome Server (TYGS) for identification.
TYGS infers species/subspecies taxonomic classification and phylogenies using digital DNA:DNA hybridization (dDDH) of whole genomes against a type-strain database of \(\GT\) 15k spp.
A dDDH d4 score of >70% is the threshold for classification to the species level.
| Strain | Species | d4 | StdDev | Closest.Type.Strain | Strain.1 |
|---|---|---|---|---|---|
| 1SSLD | Erwinia sp. | 22.5 | +/-2.5 | E. oleae | DAPP-PG-531 |
| 2ALA1 | Pseudomonas sp. | 47.6 | +/-2.6 | P. wadenswierensis | CCOS 864 |
| 2PtLD | Serratia sp. | 63.8 | +/-2.9 | S. nematodiphila | DSM 21420 |
| 2RDLD | Serratia marcescens | 92.1 | +/-1.7 | S. marcescens sakuensis | KCTC 42172 |
| 3RS3 | Pantoea sp. | 53.3 | +/-2.7 | P. endophytica | 596 |
| 3ThS2 | Pseudomonas sp. | 39.9 | +/-2.6 | P. antarctica | LMG22709 |
| 3WL2 | Acinetobacter soli | 88.7 | +/-2.1 | A. soli | KCTC 22184 |
| 3YPLB | Pseudomonas koreensis | 80.1 | +/-2.6 | P. Koreensis | JCM 14769 |
| 3YPLD | Pantoea sp. | 53.4 | +/-2.7 | P. endophytica | 596 |
| 4RDLA | Erwinia sp. | 22.0 | +/-2.4 | E. aphidicola | JCM 21238 |
| 4RLE | Pantoea agglomerans | 71.7 | +/-2.8 | P. agglomerans | NBRC 102470 |
Summary: Four of the strains, 2RDLD, 3WL2, 3YPLB, and 4RLE, were identified to the species level, while the remaining 7 strains, 1SSLD, 2ALA1, 2PtLD, 3RS3, 3ThS2, 3YPLD, and 4RDLA, were only matched to the genus level, indicating that they may represent novel species.
Output directory:
~/output/TYGS
Output files:
<strain>-TYGS.pdfFull report<strain>-16Sphy.svgPhylogenetic tree from 16S rRNA alignments<strain>-WGphy.svgPhylogenetic tree from WGS alignments
3 Final Notes
Thumbnail: click to enlarge
That completes the assembly, annotation, and identification protocols for a new isolated strain. Next will be the genetic characterization, where we will do comparative analysis to related strains and mine the genomes for genes/systems of interest; which will be considerably more challenging and interesting.
Remember to document what you do. Like all protocols, these are the standard steps and any modification or deviation should be thoroughly documented. Also, recall that your notebooks are legal documents which may get reviewed in the future; this could be supervisors, colleagues, collaborators, or reviewers or other scientists pre- and post-publication.
Original raw data should not be altered. Any changes, formatting, additions, should be done in R Studio. This is includes excel spreadsheets, downloaded files, etc. and is especially important for data provided to you from other people or sources.
It is particularly important to record all software and application versions and the dates accessed. If you submit data over several dates, or resubmit data, then note that in the dates accessed section. Include what was submitted when, and if resubmitted why it was done.
Save all files carefully and in their correct folder. Stick with the naming conventions as much as possible. This provides clarity for future researchers (and yourself in 6 months) that may be revisiting the notebook.
Remembering to download data files is particularly important due to the fact that many of the web platforms we use only retain the data for a period of time. This is a bit of a pain, but a fair trade for accessing the applications free of charge. Some of the graphical output is reproducible using the downloaded data files (e.g. annotations, phylogenetic trees), but there are some cases when it is important to download copies of any graphics you want to have access to in the future.
Finally, relax! One of the benefits of bioinformatics research is that you can’t ruin the experiment! Unlike forgetting to water the plants or using the wrong culture media, archived copies and good reproducibility practices keeps original data intact. You may not get the correct results - or you may corrupt your local copy of the data, but you can always go back one or more steps if needed. This gives researchers the ability to explore different or redundant methods for the same data set; this is roughly equivalent to adding experimental treatments, or having multiple lines of evidence to support your hypotheses.